A comparison of central‐tendency and interconnectivity approaches to clustering multivariate data with irregular structure

Abstract Questions Most clustering methods assume data are structured as discrete hyperspheroidal clusters to be evaluated by measures of central tendency. If vegetation data do not conform to this model, then vegetation data may be clustered incorrectly. What are the implications for cluster stability and evaluation if clusters are of irregular shape or density? Location Southeast Australia. Methods We define misplacement as the placement of a sample in a cluster other than (distinct from) its nearest neighbor and hypothesize that optimizing homogeneity incurs the cost of higher rates of misplacement. Chameleon is a graph‐theoretic algorithm that emphasizes interconnectivity and thus is sensitive to the shape and distribution of clusters. We contrasted its solutions with those of traditional nonhierarchical and hierarchical (agglomerative and divisive) approaches. Results Chameleon‐derived solutions had lower rates of misplacement and only marginally higher heterogeneity than those of k‐means in the range of 15–60 clusters, but their metrics converged with larger numbers of clusters. Solutions derived by agglomerative clustering had the best metrics (and divisive clustering the worst) but both produced inferior high‐level solutions to those of Chameleon by merging distantly‐related clusters. Conclusions Graph‐theoretic algorithms, such as Chameleon, have an advantage over traditional algorithms when data exhibit discontinuities and variable structure, typically producing more stable solutions (due to lower rates of misplacement) but scoring lower on traditional metrics of central tendency. Advantages are less obvious in the partitioning of data from continuous gradients; however, graph‐based partitioning protocols facilitate the hierarchical integration of solutions.


| INTRODUC TI ON
Vegetation classification is the process of delimiting types of vegetation on the basis of their relative homogeneity and distinctness from other types (Van Der Maarel & Franklin, 2013). Classification facilitates not only the description of vegetation but also the study of its relationships with the environment and attendant interacting, co-dependant organisms. Classification is thus the first step to the classification of ecosystems (sensu Tansley, 1935), and vegetation typologies have come to underpin a wide variety of conservation and natural resource management applications for terrestrial and coastal marine ecosystems including the selection of protected areas, ecosystem risk assessment and market-based mechanisms such as biodiversity offsets (Bland et al., 2019). Despite a relatively short history, vegetation science has spawned a wide range of traditions (sensu Van Der Maarel & Franklin, 2013;Whittaker, 1975). Increasingly, however, vegetation classification centers on the clustering of quantitative plot samples (De Cáceres et al., 2015. When recorded with systematic procedures, plot samples have the advantage of allowing observations from different sources to be consolidated over time, while computer-generated clustering solutions confer a degree of objectivity in the elucidation of patterns. The utility of clustering in the development of vegetation classifications is beyond question, although it is complicated by three inter-related problems. First, excepting simulated datasets, there is no agreed external point of reference with which clustering solutions can be compared. Instead, solutions based on field data must be evaluated on internal criteria (Aho et al., 2008), either geometric (e.g., cluster homogeneity) or nongeometric (e.g., species/cluster fidelity). Since these vary in the way they weigh particular characteristics of the solution, the best clustering solution may depend on its application. Second, the hyperspatial structure of vegetation data is generally unknown. The choice of both clustering algorithm and evaluation metrics therefore requires a user-supplied model. This usually (but not invariably) assumes that clusters are spheroidal, or at least that it is appropriate to evaluate solutions based on within-cluster homogeneity or other measures of central tendency (Aho et al., 2008;Lengyel et al., 2021). This is problematic because algorithms that seek to optimize central tendency can generate sub-optimal solutions when applied to data with irregular structure, and internal metrics, which assume a spheroidal model may not be appropriate measures of cluster quality. Third, biases in both the geographic and environmental distribution of samples means that cluster metrics are often optimized for data that sample the range of floristic variation either unevenly or incompletely. That is, biases may induce irregularities in data structure even if assemblages in the field form a continuum. It is not surprising then, that clustering solutions are notoriously idiosyncratic and highly sensitive to data structure, transformations, choices of algorithm, and resemblance measures (Tichy et al., 2014). This limits their robustness to new data, and hence their stability for policy and management applications.
The potential limitations of assuming a spheroidal model to data of irregular structure are illustrated in Figure 1. The data are points on a cartesian plane, normally and randomly distributed around each of six predefined centroids. The k-means algorithm fails to retrieve the underlying data structure; in (i) incorrectly splitting cluster C while merging clusters D and F; and in (ii) incorrectly splitting clusters C and F to partially merge with clusters A and D, respectively. Barton et al. (2019) termed "unnatural," although they conceded the vagueness (sensu Regan et al., 2002) of circumscribing boundaries between clusters. Less subjectively, the solution is "incorrect," for example, in Figure 1i in assigning samples that are co-located in space in the region of centroid C to different groups, while drawing in remotely-located samples from the region of centroid A. The implication is there is a high likelihood of alternative solutions arising as further data are added, or if the clustering algorithm is changed or supplied different parameters.

The resulting solutions appear what
The problem illustrated in Figure 1 arises primarily from the insensitivity of the algorithm to variations in the density of points; however, a failure to recover "natural" or "correct" clusters of irregular shape has similarly been documented in a wide range of algorithms operating on assumptions of central tendency (Barton et al., 2019; F I G U R E 1 Sample clusters (A-F) Simulated data created by supplying cartesian coordinates for six centroids and generating random deviations from the centroids as bivariate standard normal errors with sample sizes (i) n = 30, 50, 500, 50, 70, 300), and (ii) n = 20, 100, 500, 20, 100, 500) with standard deviation = 1. The boundaries of each cluster are approximated by circles, colors indicate cluster membership as determined by k-means operating on a matrix of Euclidean distances. ii) Han et al., 2012;Karypis et al., 1999;Zhao & Karypis, 2005). The core principle underpinning algorithms which seek to retrieve clusters of irregular shape and/or density is sample interconnectivity.
That is, cluster membership depends on interconnections among samples (based on pairwise similarity), rather than shared proximity to an artificial centroid or medoid. Schmidtlein et al. (2010), for example, noted two vegetation samples with no species in common could nevertheless share cluster membership provided they were connected in a chain of close neighbors. This implies clusters generated by an algorithm sensitive to irregular data structure are likely to be more heterogeneous than those derived with reference to a spheroidal model, particularly where discontinuities and variations in sample density exist.
Potential irregularities in the data structure are rarely accounted for in vegetation classification. Schmidtlein et al. (2010) documented a promising approach; however, our investigations of their ISOMAP algorithm suggested its "brute-force" approach is too computationally demanding for a dataset comprising many thousands of samples (Schmidtlein et al., 2010 Chameleon (Karypis et al., 1999, see methods for a detailed description) is one of several alternative algorithms designed to recover clusters of variable shape, which may, therefore, reproduce landscape-scale relationships more faithfully than traditional clustering techniques (Han et al., 2012). Chameleon assesses both interconnectivity and closeness of objects as a basis for determining merging decisions, an approach that results in fewer "wrong" decisions than algorithms that consider only one or the other (Karypis et al., 1999). Focusing on interconnectivity allows the algorithm to adapt automatically to the characteristics of the clusters (density and hyperspatial distribution), rather than relying on a static model (e.g., discrete spherical clusters or degrees of compactness). Therefore, provided they are strongly interconnected, samples spanning a compositional continuum can be retrieved as a single cluster even if the distribution of samples along the continuum is uneven, because Chameleon is relatively insensitive to variations in hyperspatial density (Han et al., 2012).
We suggest that a failure to take account of the underlying structure of vegetation data is likely to be one factor contributing to idiosyncrasies among clustering solutions; however, the effect is likely to be dependent on the expression and nature of discontinuities in the data structure. We postulate that accounting for the data structure is more likely to be important at broad levels of classification (lower numbers of clusters, as represented by the points in Figure 1 collectively) because discontinuities are likely to arise both naturally (e.g., between regions that share few species), due to variable data coverage (De Cáceres et al., 2018, Gellie et al., 2018 or because environmental gradients are discontinuous in geographic space (Austin, 2013). Conversely, there may be no disadvantage in assuming a spheroidal model where clustering essentially amounts to partitioning a continuum (i.e., partitioning the individual clusters in Figure 1).
In this paper, we investigate two hypotheses: (i) that an algorithm sensitive to hyperspatial irregularities in the density and arrangement of samples will produce clusters that are likely to be more "correct" (in the sense that samples are co-located with their close neighbors) but at the cost of poorer internal metrics relative to algorithms that seek to optimize around central tendency; and (ii) differences between the respective algorithms will decline with the increasing number of clusters. To test these hypotheses, we used a large regional dataset of 7541 plot samples to compare the performance of traditional clustering algorithms (k-means, hierarchical agglomerative, and divisive) with the Chameleon algorithm. For this evaluation, we used both internal metrics (homogeneity, indicator species) and the concept of "correctness," which we apply as the mis- 2 | ME THODS

| The Chameleon algorithm
Chameleon models the feature space as a k-nearest neighbor graph (sparse graph) with samples forming vertices connected by links that are proportional to pairwise similarity between samples ( Figure 2). The user specifies the number of links between samples (neighborhood range), and then in the first phase, links are progressively dissolved (in order of increasing similarity) until a userspecified number of sub-partitions has formed. In this partitioning phase, the algorithm seeks to minimize the summed length of all dissolved links, hence minimizing the affinity between samples in F I G U R E 2 Graphical representation of Chameleon's two-phase algorithm (reproduced from Karypis et al., 1999) different sub-partitions (Karypis et al., 1999

| Study area
The study area encompassed the South East Highlands and Australian Alps Bioregions of the state of NSW, Australia (Thackway & Creswell, 1995), an area of 96,089 km 2 encompassing mountains and pla- Primary factors influencing the distribution of vegetation formations in our study area include temperature, rainfall, topography, soils, and drainage (Beadle, 1981;Costin, 1954;Jenny, 1983;Keith, 2004). Alpine assemblages are restricted to elevations more than 1830 m above sea level where winter temperature minima fall below the physiological tolerance of trees (Keith, 2004). Tree cover progressively increases with decreasing elevations as the severity of winter conditions declines. Sub-alpine grassy woodlands occupy the sub-alpine tracts, characteristically with short, gnarled trees and a large compliment of cold-tolerant species also found in the alpine zone. On the southwest flank of the Great Divide, sub-alpine woodlands grade into tall wet sclerophyll forests, sustained by high orographic rainfall originating in south-westerly air flows. To the east, depending on soil lithology texture and fertility, sub-alpine woodlands grade into either Dry Sclerophyll Forest or Grassy Woodlands as annual rainfall declines in the shadow of the Divide. Grasslands replace Grassy Woodlands in frost hollows, the heaviest-texture soils, and the most moisture-limited sites (Costin, 1954). Further east of the tablelands, grasslands, and grassy woodlands are replaced by mosaics of wet and dry sclerophyll forests on the escarpment ranges as rainfall increases with increasing elevation and exposure to oceanic weather systems (Keith, 2004). Wetlands occur throughout the bioregions in areas of impeded drainage, while heathlands are among the local expressions of edaphic and topographic variation.

| Compilation of floristic data
We sourced a total of 7541 floristic plot samples from a database presence-absence to eliminate the possible effects of bias in coverabundance estimates by different observers. This transformation was considered an appropriate strategy to achieve a balance between information loss and maximizing the pool of available data in circumstances where the dataset is both large and likely to be heterogeneous (Goodall, 1978).

| Chameleon performance evaluation
We performed all Chameleon analyses on pairwise Bray-Curtis compositional similarities (also known as Sörensen(-Dice) index for presence-absence data) between samples (Clarke, 1993) using the scluster function in CLUTO software version 2.1.2 (Karypis, 2003).
First, since we found little information in the literature to guide parameter-setting, we assessed solutions of 15 clusters over a range of neighborhood sizes (15-1000 neighbors), degrees of sub-partitioning (up to 500 sub-partitions or agglomerative phase omitted), and linkage functions (single or complete) (Table 1). We carried out our initial trials using the single-link criterion function in the agglomerative phase, as recommended for nonspherical clusters (Karypis et al., 1999). For each solution, we calculated the average pairwise within-cluster similarity (homogeneity) and the proportion of samples located in clusters other than that of their nearest neighbor (misplacement rate). Specifying more than 30 subpartitions caused extensive chaining (sensu Peet & Roberts, 2013).
We repeated the relevant trials using an option forcing Chameleon to prioritize large clusters over small ones in the partitioning phase as recommended to counter a tendency to chaining in a solution (Karypis et al., 1999). On the basis of the preliminary results, we undertook subsequent analyses using the complete linkage function and assessed performance over a range of cluster numbers (15-250 clusters) and degrees of sub-partitioning (30-500 sub-partitions) with neighborhood size fixed at either 30 or 1000 (Table 1).

| Comparison of algorithms
A very wide range of algorithms has been applied to the classifica-  Aho et al., 2008;Lengyel et al., 2021). These algorithms vary both in complexity and in the extent they have been applied, but since they are generally applied with the expectation of retrieving compact clusters, we sought to compare our alternative approach with traditional, widely-understood approaches (Kent, 2011). We compared Chameleon cluster member sets with those derived from: (i) k-means clustering (Belbin, 1987;MacQueen, 1967); (ii) unweighted pair-group method with arithmetic means (Belbin et al., 1992); and (iii) polythetic division (MacNaughton-Smith et al., 1964;Belbin et al., 1984), all implemented using the PATN package (Belbin, 1987). We used each algorithm to compute solutions ranging from 15 to 250 clusters (Table 1.) and characterized solutions in terms of homogeneity and misplacement rate (as described above), the number of species occurring at higher frequencies within each cluster than in the dataset as a whole (cumulative hypergeometric probability >0.999), and the number of species with standardized phi > 0.35 (Tichý & Chytrý, 2006).

| Comparing clustering solutions with a reference classification
We assessed the extent to which clustering solutions (15 classes) produced by each algorithm retrieved species sets characterizing the units of an established subcontinental-scale vegetation classification that covers 800,000 km 2 in southeastern Australia (Keith, 2004), including the study area (c. 11% of total area). The reference classification was developed from the top down based on an extensive review of vegetation studies, field reconnaissance, and qualitative synthesis of vegetation maps available at the time (Keith, 2004).

| RE SULTS
A summary of the analytical trials performed and a brief synopsis of the results are contained in Table 1. A detailed description of the results follows: Trends in the misplacement rate and average within-cluster homogeneity in Chameleon cluster solutions generated using the single-link functions are summarized in Figure 3. The misplacement rate rose with increasing neighborhood size (Figure 3a). This result may reflect aberrations caused by forcing members of small clusters to forge links with samples in other clusters as illustrated F I G U R E 3 Misplacement rate (a) and average similarity among objects within clusters (b) as a function of neighborhood size. Results for solutions obtained with more than 30 sub-partitions are not shown in (a) because samples were frequently concentrated in a single cluster (chaining). Trials incorporating an agglomeration phase (a > 15) were performed using a single-linkage function. or with the agglomeration phase omitted, were better (lower rates of misplacement and higher homogeneity) than those derived with the divisive algorithm but worse than those derived with the agglomerative algorithm ( Figure 6). However, 15-class solutions derived by Chameleon were more even (i.e., the clusters had similar numbers of members) than those produced by either the agglomerative or divisive algorithms (Figure 7). Chameleon solutions were better than those of k-means in broad classifications (15-60 classes) but equivalent at finer classifications (90-250 classes). Chameleon produced more even 15-class solutions than k-means (Figure 7).
Clusters derived by Chameleon solutions were generally characterized by fewer diagnostic species than those derived using the traditional algorithms (Table 2); however, species diagnostic of Chameleon clusters corresponded more with those characterizing units of a reference classification for our study area than those diagnostic of clusters derived by agglomerative or divisive algorithms, both in the range of units represented and with less overlap between unrelated units (Tables 3-5). Clusters derived by k-means retrieved units of the reference classification with efficiency similar to Chameleon (Table 6).

| Performance of alternative clustering methods applied to irregular data structure
Overall, the results of our analyses support both of our hypotheses; graph-theoretic clustering produced less misplacement than central-tendency clustering, particularly for broad groupings.
Several caveats apply to this conclusion: (i) the utility of the different clustering methods cannot be encapsulated solely in terms of cluster F I G U R E 4 Clustering of simulated data (Figure 1) by Chameleon illustrating increasing rates of misplacement with increasing neighborhood size using the single-linkage function. Samples with the same color were placed in the same cluster. infinite range of combinations. The clearest support for our hypotheses was evident in the comparison between solutions derived using F I G U R E 5 Trends in misplacement rate and within-cluster homogeneity with increasing neighborhood size or increasing number of sub-partitions in the agglomeration phase. The effect of increasing sub-partitions using the singlelinkage function is not shown due to chaining as described above). Trendlines are least-squares linear regressions. Data describing the respective 15-cluster solutions derived by k-means, agglomerative, and divisive algorithms are plotted for comparison (see Figure 6) (cl, complete linkage; sl, single-linkage). Conversely, our three traditional algorithms scored equally highly in terms of the number of diagnostic species and clearly higher than the best Chameleon solutions, suggesting that unevenness in cluster membership numbers could, in fact, be symptomatic of biases in the distribution of samples among "natural" clusters, and that the three traditional algorithms performed better in detecting these uneven clusters (as evidenced by higher numbers of diagnostic species).

F I G U R E 6 Trends in misplacement
Comparisons with a reference classification suggest unevenness in the cluster size is more likely to be indicative of chaining because the largest clusters were made up of samples representing multiple classes (as indicated by the range in diagnostic species), some of which are relatively distantly related. This phenomenon was most strongly evident in the agglomerative and divisive solutions (Tables 3-5). This reflects a well-known weakness of agglomerative or divisive methods, which incorporate merge or split decisions based on the aggregate properties of clusters. Such methods require either unrealistic assumptions concerning the structure of the data and/or sequential merge/split decisions, which cannot be reversed and which are necessarily sensitive to the composition of the dataset (Han et al., 2012).
While we did not evaluate the quality of solutions of greater than 15 classes, the agglomerative algorithm appeared to outperform TA B L E 2 Total number of diagnostic species across all classes as determined by frequency (statistically higher than background frequency) or standardized phi (Tichý & Chytrý, 2006) Algorithm all others in producing 250-cluster solutions with low rates of misplacement and high homogeneity, although its subsequent, upperhierarchical groupings became progressively less meaningful because of poor merging decisions. We conclude that Chameleon and k-means generated the most informative solutions of 15 clusters with the former perhaps better representing the natural structure of the data while the latter produced more homogeneous groupings.

| Are "natural" clusters necessarily less homogeneous?
Although our approach trades-off cluster homogeneity for improvements in (mis)placement of samples in the cluster, the degree of trade-off is likely to depend on the structure of individual datasets.
In our case study, the misplacement rate achieved by Chameleon was half that of k-means at the cost of a 10% reduction in cluster homogeneity. If the clusters Chameleon retrieved in our dataset are indeed irregular shapes, then our results suggest they are unlikely to be highly elongated, and variability in our data structure tends toward uneven density rather than irregular shape.
The question of whether "natural" clusters necessarily have fewer diagnostic species is more difficult to resolve based on our analyses. A priori, we expect that more heterogenous clusters would mean fewer diagnostic species, the pattern reflected in our results.
However, Schmidtlein et al. (2010) demonstrated that Isopam, an algorithm that adapts to irregular cluster shapes, consistently outperformed other algorithms in terms of the number of indicator species (sensu Dufrêne & Legendre, 1997) and was also highly ranked in terms of the number of species with standardized phi > 0.35 (Tichý & Chytrý, 2006 In theory, algorithms sensitive to data structure may reduce the extent of this problem, at least at some level of data partitioning. Tozer et al. (2022) found that Chameleon's novel approach to modeling intersample relationships greatly facilitated the revision of an earlier broad classification of forested wetlands based on substantially fewer plot samples (Keith & Scott, 2005). Unlike many traditional methods, which incorporate merge or split decisions based on the aggregate properties of clusters, Chameleon operates on interconnected neighborhood sets. In Tozer et al.'s (2022) case, these were structured on the same similarity metric used in the original analysis.
They considered these features pivotal because the algorithm could potentially minimize the impact of adding new data by retaining connections between samples from the original set (Tozer et al., 2022). Tozer et al. (2022) reasoned that since Chameleon dissolves connections between relatively weakly-connected samples in the partitioning phases, strong pairwise relationships between samples underpinning clusters in the original analysis could be preserved (and reflected more faithfully) in their new Chameleon-derived clusters.
We note that there is some uncertainty in relation to how the algorithm can be best implemented. We employed the Cluto clustering package (Karypis, 2003) distributed by Chameleon's authors; however, we noted some inconsistencies in relation to the parameters offered compared with the description of the algorithm (Karypis et al., 1999). Furthermore, Barton et al. (2019) have suggested that Cluto's implementation does not entirely embody the Chameleon concept. Barton et al. (2019) reproduced the results of Karypis et al. (1999) and developed an alternative implementation, which demonstrates improved performance, although it relies on a different partitioning algorithm because the original is proprietary protected.

| CON CLUS ION
Scale-dependent irregularities in vegetation data can affect the utility and stability of clustering solutions underlying vegetation classification schemes. The existence of clusters of irregular shape and density implies that novel metrics are required in their evaluation because such clusters may not score well on traditional metrics that assume a spheroidal model (Aho et al., 2008). Evaluating the utility of such cluster solutions requires metrics that assess interconnectivity rather than central tendency.
Although our results support the theoretical notion that graphtheoretic algorithms such as Chameleon are better suited to the task of elucidating vegetation classes, the trade-offs in its solutions, and the ways in which these improve upon those retrieved by traditional clustering approaches require further quantification. We suggest this is a worthwhile endeavor because Chameleon offers a conceptually simple model, can process very large datasets quickly, and potentially presents a solution to the problem of integrating plot-based classifications across hierarchical levels.
Keith: Conceptualization (equal); project administration (equal); supervision (equal). greatly improved by the inciteful and constructive suggestions we received from our reviewers and editors.
Data were imported in a plain text file with n + 1 lines, the first line containing the number of rows, and the remaining n lines containing similarity values for each row (Karypis, 2003).